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Abstract. We describe a data structure that maintains the number of triangles in 

a dynamic undirected graph, subject to insertions and deletions of edges and of 
degree-zero vertices. More generally it can be used to maintain the number of 
copies of each possible three-vertex subgraph in time 0(h) per update, where h 
is the h-index of the graph, the maximum number such that the graph contains h 
vertices of degree at least h. We also show how to maintain the /i-index itself, and 
a collection of h high-degree vertices in the graph, in constant time per update. 
Our data structure has applications in social network analysis using the exponen- 
tial random graph model (ERGM); its bound of 0{h) time per edge is never worse 
than the &{y/m) time per edge necessary to Ust all triangles in a static graph, and 
is strictly better for graphs obeying a power law degree distribution. In order to 
better understand the behavior of the A-index statistic and its implications for the 
performance of our algorithms, we also study the behavior of the A-index on a set 
of 136 real-world networks. 

1 Introduction 

The exponential random graph model (ERGM, or p* model) [17, 29, 34] is a general 
technique for assigning probabilities to graphs that can be used both to generate sim- 
ulated data for social network analysis and to perform probabilistic reasoning on real- 
world data. In this model, one fixes the vertex set of a graph, identifies cerisan features 
fi in graphs on that vertex set, determines a weight w,- for each feature, and sets the 
probability of each graph G to be proportional to an exponential function of the sum of 
its features' weights, divided by a normalizing constant Z: 

Pr(G) = ^^P%^. 

Z is found by summing over all graphs on that vertex set: 

Z=^exp ^ Wi. 

G fiEG 

For instance, if each potential edge is considered to be a feature and all edges have 
weight In the normalizing constant Z will be (1 — p)^"("^^y^^ and the probability 

of any particular m-edge graph will be p"'{l — giving rise to the famil- 

iar Erdos-Renyi G{n,p) model. However, the ERG model is much more general than 



the Erdos-Renyi model: for instance, an ERGM in which the features are whole graphs 
can represent arbitrary probabilities. The generahty of this model, and its ability to de- 
fine probability spaces lacking the independence properties of the simpler Erdos-Renyi 
model, make it difficult to analyze analytically. Instead, in order to generate graphs in 
an ERG model or to perform other forms of probabilistic reasoning with the model, one 
typically uses a Markov Chain Monte Carlo method [30] in which one performs a large 
sequence of small changes to sample graphs, updates after each change the counts of 
the number of features of each type and the sum of the weights of each feature, and uses 
the updated values to determine whether to accept or reject each change. Because this 
method must evaluate large numbers of graphs, it is important to develop very efficient 
algorithms for identifying the features that are present in each graph. 

Typical features used in these models take the form of small subgraphs: stars of 
several edges with a common vertex (used to represent constraints on the degree distri- 
bution of the resulting graphs), triangles (used in the triad model [18], an important pre- 
decessor of ERG models, to represent the likelihood that friends-of-friends are friends 
of each other), and more complicated subgraphs used to control the tendencies of sim- 
pler models to generate unreaUstically extremal graphs [31]. Using highly local features 
of this type is important for reasons of computational efficiency, matches well the type 
of data that can be obtained for real-world social networks, and is well motivated by 
the local processes believed to underly many types of social network. Thus, ERGM 
simulation leads naturally to problems of subgraph isomorphism, listing or counting all 
copies of a given small subgraph in a larger graph. 

There has been much past algorithmic work on subgraph isomorphism problems. It 
is known, for instance, that an n-vertex graph with m edges may have 0(m^/^) triangles 
and four-cycles, and all triangles and four-cycles can be found in time 0{nr'l^) [6,21]. 
All cycles of length up to seven can be counted rather than listed in time of (9(n") [3] 
where CO « 2.376 is the exponent from the asymptotically fastest known matrix multipli- 
cation algorithms [7]; this improves on the previous 0{m^l'^) bounds for dense graphs. 
Fast matrix multiplication has also been used for more general problems of finding and 
counting small cliques in graphs and hypergraphs [10, 23, 25, 33, 35]. In planar graphs, 
or more generally graphs of bounded local treewidth, the number of copies of any fixed 
subgraph may be found in hnear time [13, 14], even though this number may be a 
large polynomial of the graph size [11]. Approximation algorithms for subgraph iso- 
morphism counting problems based on random sampling have also been studied, with 
motivating applications in bioinformatics [9, 22, 28]. However, much of this subgraph 
isomorphism research makes overly restrictive assumptions about the graphs that are 
allowed as input, runs too slowly for the ERGM appUcation, depends on impractically 
complicated matrix multiplication algorithms, or does not capture the precise subgraph 
counts needed to accurately perform Markov Chain Monte Carlo simulations. 

Markov Chain Monte Carlo methods for ERGM-based reasoning process a se- 
quence of graphs each differing by a small change from a previous graph, so it is natural 
to seek additional efficiency by applying dynamic graph algorithms [15, 16, 32], data 
structures to efficiently maintain properties of a graph subject to vertex and edge inser- 
tions and deletions. However, past research on dynamic graph algorithms has focused 
on problems of connectivity, planarity, and shortest paths, and not on finding the fea- 
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tures needed in ERGM calculations. In this paper, we apply dynamic graph algorithms 
to subgraph isomorphism problems important in ERGM feature identification. To our 
knowledge, this is the first work on dynamic algorithms for subgraph isomorphism. 

A key ingredient in our algorithms is the /i-index, a number introduced by Hirsch [20] 
as a way of balancing prolixity and impact in measuring the academic achievements of 
individual researchers. Although problematic in this application [1], the /i-index can 
be defined and studied mathematically, in graph-theoretic terms, and provides a con- 
venient measure of the uniformity of distribution of edges in a graph. Specifically, for 
a researcher, one may define a bipartite graph in which the vertices on one side of the 
bipartition represent the researcher's papers, the vertices on the other side represent oth- 
ers' papers, and edges correspond to citations by others of the researcher's papers. The 
/i-index of the researcher is the maximum number h such that at least h vertices on the 
researcher's side of the bipartition each have degree at least h. We generalize this to 
arbitrary graphs, and define the /z-index of any graph to be the maximum h such that the 
graph contains h vertices of degree at least h. Intuitively, an algorithm whose running 
time is bounded by a function of h is capable of tolerating arbitrarily many low-degree 
vertices without slowdown, and is only mildly affected by the presence of a small num- 
ber of very high degree vertices; its running time depends primarily on the numbers of 
intermediate-degree vertices. As we describe in more detail in Section]?] the /i-index of 
any graph with m edges and n vertices is sandwiched between m/n and \/2m, so it is 
sublinear whenever the graph is not dense, and the worst-case graphs for these bounds 
have an unusual degree distribution that is unlikely to arise in practice. 

Our main result is that we may maintain a dynamic graph, subject to edge insertions, 
edge deletions, and insertions or deletions of isolated vertices, and maintain the number 
of triangles in the graph, in time 0{h) per update where h is the /z-index of the graph 
at the time of the update. This compares favorably with the time bound of 0(ot^/^) 
necessary to list all triangles in a static graph. In the same 0{h) time bound per update 
we may more generally maintain the numbers of three-vertex induced subgraphs of each 
possible type, and in constant time per update we may maintain the /z-index itself. Our 
algorithms are randomized, and our analysis of them uses amortized analysis to bound 
their expected times on worst-case input sequences. Our use of randomization is limited, 
however, to the use of hash tables to store and retrieve data associated with keys in 0(1) 
expected time per access. By using either direct addressing or deterministic integer 
searching data structures instead of hash tables we may avoid the use of randomness at 
an expense of either increased space complexity or an additional factor of (9(loglogn) 
in time complexity; we omit the details. 

We also study the behavior of the /z-index, both on scale-free graph models and on 
a set of real-world graphs used in social network analysis. We show that for scale-free 
graphs, the /z-index scales as a power of «, less than its square root, while in the real- 
world graphs we studied the scaling exponent appears to have a bimodal distribution. 

2 Dynamic /z-Indexes of Integer Functions 

We begin by describing a data structure for the following problem, which generalizes 
that of maintaining /z-indexes of dynamic graphs. We are given a set S, and a function 
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/ from S to the non-negative integers, both of which may vary discretely through a 
sequence of updates: we may insert or delete elements of S (with arbitrary function 
values for the inserted elements), and we may make arbitrary changes to the function 
value of any element of S. As we do so, we wish to maintain a set H such that, for 
every x & H, f{x) > \H\, with H as large as possible with this property. We call \H\ 
the h-index of S and /, and we call the partition of S into the two subsets {H,S\H) an 
h-partition of S and /. 

To do so, we maintain the following data structures: 

- A dictionary F mapping each x&Sio its value under /: F[x] = f{x). 

- The set H (stored as a dictionary mapping members of // to an arbitrary value). 

- The sets = {x G // I f{x) = \H\}. 

- A dictionary C mapping each non-negative integer / to the set {x e 5\B | f{x) = i}. 
We only store these sets when they are non-empty, so the situation that there is no 
X with /(x) = i can be detected by the absense of i among the keys of C. 

To insert an element x into our structure, we first set F[x] = /(x), and add x to 
C[/(x)] (or add a new set {x} at C[/(x)] if there is no existing entry for /(x) in C). 
Then, we test whether /(x) > \H\. If not, the /z-index does not change, and the insertion 
operation is complete. But if /(x) > |//|, we must include x into H. If B is nonempty, we 
choose an arbitrary y G B, remove y from B and from H, and add y to C[|//|] (or create 
a new set {y} if there is no entry for \H\ in C). Finally, if /(x) > \H\ and B is empty, the 
insertion causes the /z-index (\H\) to increase by one. In this case, we test whether there 
is an entry for the new value of \H\ in C. If so, we set B to equal the identity of the set 
in C[|// 1] and delete the entry for \H\ in C; otherwise, we set B to the empty set. 

To remove x from our structure, we remove its entry from F and we remove it from B 
(if it belongs there) or from the appropriate set in C[/(x)] otherwise. If x did not belong 
to H, the /i-index does not change, and the deletion operation is complete. Otherwise, let 
h be the value of \H\ before removing x. We remove x from H, and attempt to restore the 
lost item from H by moving an element from C[h] to B (deleting C[h] if this operation 
causes it to become empty). But if C has no entry for h, the /i-index decreases; in this 
case we store the identity of set B into C[h], and set B to be the empty set. 

Changing the value of f{x) may be accomphshed by deleting x and then reinserting 
it, with some care so that we do not update H ifx was already in H and both the old and 
new values of f{x) are at least equal to |//| . 

Theorem 1. The data structure described above maintains the h-index ofS and f, and 
an h-partition ofS and f, in constant time plus a constant number of dictionary opera- 
tions per update. 

We defer the proof to an appendix. 
3 Gradual Approximate /i-Partitions 

Although the vector /z-index data structure of the previous section allows us to maintain 
the /z-index of a dynamic graph very efficiently, it has a property that would be unde- 
sirable were we to use it directly as part of our later dynamic graph data structures: 
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the /z-partition {H,S\H) changes too frequently. Changes to the set H will turn out to 
be such an expensive operation that we only wish them to happen, on average, 0(1/ h) 
times per update. In order to achieve such a small amount of change to H, we need to 
restrict the set of updates that are allowed: now, rather than arbitrary changes to /, we 
only allow it to be incremented or decremented by a single unit, and we only allow an 
element x to be inserted or deleted when f{x) — 0. We now describe a modification of 
the //-partition data structure that has this property of changing more gradually for this 
restricted class of updates. 

Specifically, along with all of the structures of the //-partition, we maintain a set 
P C H describing a partition {P,S\P). When an element of x is removed from H, we 
remove it from P as well, to maintain the invariant that P C H. However, we only add 
an element x to P when an update (an increment of f{x) or decrement of f{y) for some 
other element y) causes f{x) to become greater than or equal to 2|//|. The elements to 
be added to P on each update may be found by maintaining a dictionary, parallel to C, 
that maps each integer / to the set {x E H\P \ f{x) — i}. 

Theorem 2. Let a denote a sequence of operations to the data structure described 
above, starting from an empty data structure. Let h, denote the value of h after t op- 
erations, and let q = Li V^i- Then the data structure undergoes 0{q) additions and 
removals of an element to or from P. 

We defer the proof to an appendix. For our later application of this technique as 
a subroutine in our triangle-finding data structure, we will need a more local analysis. 
We may divide a sequence of updates into epochs, as follows: each epoch begins when 
the /i-index reaches a value that differs from the value at the beginning of the previous 
epoch by a factor of two or more. Then, by Lemma [T] an epoch with h as its initial 
/j-index lasts for at least Q.{h^) steps. Due to this length, we may assign a full unit of 
credit to each member of P at the start of each epoch, without changing the asymptotic 
behavior of the total number of credits assigned over the course of the algorithm. With 
this modification, it follows from the same analysis as above that, within an epoch of s 
steps, with an /i-index of h at the start of the epoch, there are 0{s/h) changes to P. 

4 Counting Triangles 

We are now ready to describe our data structure for maintaining the number of triangles 
in a dynamic graph. It consists of the following information: 

- A count of the number of triangles in the current graph 

- A set £ of the edges in the graph, indexed by the pair of endpoints of the edge, 
allowing constant-time tests for whether a given pair of endpoints are linked by an 
edge. 

- A partition of the graph vertices into two sets H and V\H as maintained by the 
data structure from Section[3] 

- A dictionary P mapping each pair of vertices m, v to a number P[u, v], the number of 
two-edge paths from m to v via a vertex ofV\H. We only maintain nonzero values 
for this number in P; if there is no entry in P for the pair u, v then there exist no 
two-edge paths via V\H that connect u to v. 
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Theorem 3. The data structure described above requires space 0{mh) and may be 
maintained in 0{h) randomized amortized time per operation, where h is the h-index of 
the graph at the time of the operation. 

Proof. Insertion and deletion of vertices with no incident edges requires no change to 
most of these data structures, so we concentrate our description on the edge insertion 
and deletion operations. 

To update the count of triangles, we need to know the number of triangles uvw 
involving the edge uv that is being deleted or inserted. Triangles in which the third 
vertex w belongs to H may be found in time 0{h) by testing all members of H, using 
the data structure for E to test in constant time per member whether it forms a triangle. 
Triangles in which the third vertex w does not belong to H may be counted in time 0(1) 
by a single lookup in P. 

The data structure for E may be updated in constant time per operation, and the 
partition into H and V\H may be maintained as described in the previous sections 
in constant time per operation. Thus, it remains to describe how to update P. If we 
are inserting an edge uv, and u does not belong to H, it has at most 2h neighbors; we 
examine all other neighbors w of m and for each such neighbor increment the counter in 
f [v, w] (or create a new entry in P\v, w] with a count of 1 if no such entry already exists). 
Similarly if v does not belong to H we examine all other neighbors w of v and for each 
such neighbor increment w]. If we are deleting an edge, we similarly decrement the 
counters or remove the entry for a counter if decrementing it would leave a zero value. 
Each update involves incrementing or decrementing 0{h) counters and therefore may 
be implemented in 0{h) time. 

Finally, a change to the graph may lead to a change in H, which must be reflected in 
P. If a vertex v is moved from H toV\H,we examine all pairs u,w of neighbors of v and 
increment the corresponding counts in P[u, w] , and if a vertex v is moved from V\H to 
H we examine all pairs m, w of neighbors of v and decrement the corresponding counts 
in f [m,w]. This step takes time 0{h^), because v has 0{h) neighbors when it is moved 
in either direction, but as per the analysis in Section |3] it is performed an average of 
0{\/h) times per operation, so the amortized time for updates of this type, per change 
to the input graph, is 0{h). 

The space for the data structure is 0{m) for E, 0{n) for the data structure that 
maintains H, and 0{mh) for P because each edge of the graph belongs to 0{h) two- 
edge paths through low-degree vertices. □ 

5 Subgraph Multiplicity 

Although the data structure of Theorem |3] only counts the number of triangles in a 
graph, it is possible to use it to count the number of three-vertex subgraphs of all types, 
or the number of induced three-vertex subgraphs of all types. In what follows we let 
Pi = Pi{G) denote the number of paths of length / in G, and we let c,- = c,(G) denote 
the number of cycles of length / in G. 

The set of all edges in a graph G among a subset of three vertices {m, v, w} determine 
one of four possible induced subgraphs; an independent set with no edges, a graph with 
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a single edge, a two-star consisting of two edges, or a triangle. Let go, gi, g2, and ^3 
denote the numbers of three-vertex subgraphs of each of these types, where gj counts 
the three-vertex induced subgraphs that have / edges. 

Observe that it is trivial to maintain for a dynamic graph, in constant time per oper- 
ation, the three quantities n, m, and p2, where n denotes the number of vertices of the 
graph, m denotes the number of edges, and p2 denotes the number of two-edge paths 
that can be formed from the edges of the graph. Each change to the graph increments 
or decrements n or m. Additionally, adding an edge uv to a graph where u and v already 
have du and dr incident edges respectively increases p2 by du + dy, while removing an 
edge uv decreases p2 by du + dy — 2. Letting C3 denote the number of triangles in the 
graph as maintained by Theorem |3] the quantities described above satisfy the matrix 
equation 



"1111" 








"«(«-l)(«-2)/6 


12 3 
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m{n — 2) 


00 13 




g2 
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000 1 




.s-i. 
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Each row of the matrix corresponds to a single linear equation in the gi values. The 
equation from the first row, gQ+ g\+ g2+ gi — (3), can be interpreted as stating that all 
triples of vertices form one graph of one of these types. The equation from the second 
row, gi +2^2 + 3^3 — m(n — 2), is a form of double counting where the number of edges 
in all three- vertex subgraphs is added up on the left hand side by subgraph type and on 
the right hand side by counting the number of edges (m) and the number of triples each 
edge participates in (n — 2). The third row's equation, g2 + 3^3 ~ p2, similarly counts 
incidences between two-edge paths and triples in two ways, and the fourth equation 
g3 = C3 follows since each three vertices that are connected in a triangle cannot form 
any other induced subgraph than a triangle itself. 

By inverting the matrix we may reconstruct the g values; 

^3 = C3 

82 = P2- 3^3 

= m{n - 2) - {2g2 + 3^3) 
go = -{gi+g2 + g3)- 

Thus, we may maintain each number of induced subgraphs gi in the same asymptotic 
time per update as we maintain the number of triangles in our dynamic graph. The 
numbers of subgraphs of different types that are not necessarily induced are even easier 
to recover: the number of three-vertex subgraphs with / edges is given by the ith entry 
of the vector on the right hand side of the matrix equation. 

As we detail in an appendix, it is also possible to maintain efficiently the numbers 
of star subgraphs of a dynamic graph, and the number of four- vertex paths in a dynamic 
graph. 
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6 Weighted Edges and Colored Vertices 



It is possible to generalize our triangle counting method to problems of weighted trian- 
gle counting: we assign each edge uv of the graph a weight w„,,, define the weight of a 
triangle to be the product of the weights of its edges, and maintain the total weight of 
all triangles. For instance, if < w^, < 1 and each edge is present in a subgraph with 
probability w„v, then the total weight gives the expected number of triangles in that 
subgraph. 

Theorem 4. The total weight of all triangles in a weighted dynamic graph, as de- 
scribed above, may be maintained in time 0(h) per update. 

Proof. We modify the structure [m , v] maintained by our triangle-finding data structure, 
so that it stores the weight of all two-edge paths from u to v. Each update of an edge 
uv in our structure involves a set of individual triangles uvx involving vertices x ^ H 
(whose weight is easily calculated) together with the triangles formed by paths counted 
in P[m,v] (whose total weight is f [m,v']w„v). The same time analysis from Theorem|3] 
holds for this modified data structure. □ 

For social networking ERGM applications, an alternative generalization may be ap- 
propriate. Suppose that the vertices of the given dynamic graph are colored; we wish to 
maintain the number of triangles with each possible combination of colors. For instance, 
in graphs representing sexual contacts [24], edges between individuals of the same sex 
may be less frequent than edges between individuals of opposite sexes; one may model 
this in an ERGM by assigning the vertices two different colors according to whether 
they represent male or female individuals and using feature weights that depend on the 
colors of the vertices in the features. As we now show, problems of counting colored 
triangles scale well with the number of different groups into which the vertices of the 
graph are classified. 

Theorem 5. Let G be a dynamic graph in which each vertex is assigned one of k dif- 
ferent colors. Then we may maintain the numbers of triangles in G with each possible 
combination of colors, in time 0{h-\-k) per update. 

Proof. We modify the structure P[u, v] stored by our triangle-finding data structure, to 
store a vector of k numbers: the ith entry in this vector records the number of two- 
edge paths from u to v through a low-degree vertex with color /. Each update of an 
edge uv in our structure involves a set of individual triangles uvx involving vertices 
X E H (whose colors are easily observed) together with the triangles formed by paths 
counted in /"[mjv] (with k different possible colorings, recorded by the entries in the 
vector f[M, v]). Thus, the part of the update operation in which we compute the numbers 
of triangles for which the third vertex has low degree, by looking up u and v in P, takes 
time 0{k) instead of (9(1). The same time analysis from Theorem [3] holds for all other 
aspects of this modified data structure. □ 

Both the weighting and coloring generalizations may be combined with each other 
without loss of efficiency. 
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7 How Small is the /z-Index of Typical Graphs? 



It is straightforward to identify the graphs with extremal values of the /z-index. A split 
graph in which an /i-vertex clique is augmented by adding n — h vertices, each connected 
only to the vertices in the clique, has n vertices and m = h{n — 1) edges, achieving an 
/i-index of m/{n — 1). This is the minimum possible among any graph with n vertices 
and m edges: any other graph may be transformed into a split graph of this type, while 
increasing its number of edges and not decreasing h, by finding an /z-partition {H,V\H) 
and repeatedly replacing edges that do not have an endpoint in H by edges that do have 
such an endpoint. The graph with the largest /i-index is a clique with m edges together 
with enough isolated vertices to fill out the total to n; its /z-index is \/2m{l +o(l)). 
Thus, for sparse graphs in which the numbers of edges and vertices are proportional to 
each other, the /i-index may be as small as (9(1) or as large as Q.{^/n). At which end of 
this spectrum can we expect to find the graphs arising in social network analysis? 

One answer can be provided by fitting mathematical models of the degree distribu- 
tion, the relation between the number of incident edges at a vertex and the number of 
vertices with that many edges, to social networks. For many large real-world graphs, 
observers have reported power laws in which the number of vertices with degree d is 
proportional to nt/^^ for some constant y > 1; a network with this property is called 
scale-free [2,24,26,27]. Typically, ylies in or near the interval 2 < Y< 3 although more 
extreme values are possible. The /z-index of these graphs may be found by solving for 
the h such that h — nh^^; that is, h = 0(«'/f^+^)). For any y > 1 this is an asymptotic 
improvement on the worst-case 0{^/n) bound for graphs without power-law degree 
distributions. For instance, for y = 2 this would give a bound of h — Oin^l^) while for 
y = 3 it would give h = 0{n^l'^). That is, by depending on the /i-index as it does, our 
algorithm is capable of taking advantage of the extra structure inherent in scale-free 
graphs to run more quickly for them than it does in the general case. 

To further explore /i-index behavior in real-world networks, we computed the h- 
index for a collection of 136 network data sets typical of those used in social network 
analysis. These data sets were drawn from a variety of sources traditionally viewed as 
common repositories for such data. The majority of our data sets were from the well 
known Pajek datasets [4]. Pajek is a program used for the analysis and visualization of 
large networks. The collection of data available with the Pajek software includes cita- 
tion networks, food-webs, friendship network, etc. In addition to the Pajek data sets, we 
included network data sets from UCINET [5]. Another software package developed for 
network analysis, UCINET includes a corpus of data sets that are more traditional in 
the social sciences. Many of these data sets represent friendship or communication rela- 
tions; UCINET also includes various social networks for non-human animals. We also 
used network data included as part of the statnet software suite [19], statistical model- 
ing software in R. statnet includes ERGM functionality, making it a good example for 
data used specifically in the context of ERG models. Finally, we included data available 
on the UCI Network Data Repository [8], including some larger networks such as the 
WWW, blog networks, and other online social networks. By using this data we hope to 
understand how the /i-index scales in real-world networks. 

Details of the statistics for these networks are presented in an appendix; a summary 
of the statistics for network size and /i-index are in Table [T] below. For this sample of 
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136 real-world networks, the /j-index ranges from 2 to 1 16. The row of summary statis- 
tics for log /z/ log n suggests that, for many networks, h scales as a sublinear power of n. 
The one case with an /i-index of 116 represents the ties among Slovenian magazines 
and journals between 1999 and 2000. The vertices of this network represent journals, 
and undirected edges between journals have an edge weight that represents the num- 
ber of shared readers of both journals; this network also includes self-loops describing 
the number of all readers that read this journal. Thus, this is a dense graph, more ap- 
propriately handled using statistics involving the edge weights than with combinatorial 
techniques involving the existence or nonexistence of triangles. However, this is the 
only network from our dataset with an /z-index in the hundreds. Even with significantly 
larger networks, the /i-index appears to scale sublinearly in most cases. 
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Table 1. Summary statistics for real-world network data 



A histogram of the /i-index data in Figure [T] clearly shows a bimodal distribution. 
Additionally, as the second peak of the bimodal distribution corresponds to a scaling 
exponent greater than 0.5, the graphs corresponding to that peak do not match the pre- 
dictions of the scale-free model. However we were unable to discern a pattern to the 
types of networks with smaller or larger /j-indices, and do not speculate on the reasons 
for this bimodality. We look more deeply at the scaling of the /i-index using standard 
regression techniques in an appendix. 
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0.2 0.4 0.6 0.8 1.0 

Fig. 1. A frequency histogram for log/?/ log n. 
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8 Discussion 



We have defined an interesting new graph invariant, the /i-index, presented efficient 
dynamic graph algorithms for maintaining the h-index and, based on them, for main- 
taining the set of triangles in a graph, and studied the scaling behavior of the h-index 
both on theoretical scale-free graph models and on real-world network data. 

There are many directions for future work. For sparse graphs, the /j-index may be 
larger than the arboricity, a graph invariant used in static subgraph isomorphism [6, 12]; 
can we speed up our dynamic algorithms to run more quickly on graphs of bounded 
arboricity? We handle undirected graphs but the directed case is also of interest. We 
would Uke to find efficient data structures to count larger subgraphs such as 4-cycles, 4- 
cliques, and claws; dynamic algorithms for these problems are Ukely to be slower than 
our triangle-finding algorithms but may still provide speedups over static algorithms. 
Another network statistic related to triangle counting is the clustering coefficient of a 
graph; can we maintain it efficiently? Additionally, there is an opportunity for additional 
work in implementing our data structures and testing their efficiency in practice. 
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Appendix I: Proof of Theorems [T] and |2] 



We begin by proving Theorem [T] the correctness of our data structure for maintaining 
the /i-index and /i-partition, and the analysis showing that it takes constant time per 
operation. 

Proof. The time analysis follows immediately from the description of the data structure 
update operations. These updates maintain invariant the properties of the set B and 
the dictionary of sets C[/] that they partition S properly by their values of f{x), that 
B consists exactly of those elements of H with /(x) = |//|, and that H consists of B 
together with those elements of S with f{x) > \H\. 

Thus, h = \H\ has the property that there exists a set (namely H) with h elements, 
all of which have function value at least h. There can be no larger /i' with the same 
property, because all of the elements with value greater than h belong to H already so 
there can be no larger set of elements with larger values. Thus, h is the correct /i-index 
of S and /, and {H,S\H) is a correct /z-partition. □ 

Next we prove Theorem|2] the time analysis of our data structure for maintaining a 
partition of a graph into low and high degree vertices with a very low number of moves 
of vertices from one part of the partition to the other 

As an accounting technique for the analysis of the algorithm (not something actu- 
ally stored within our data structure) we associate a (fractional) number of "credits" 
with each member of P, that is zero when that element is added to P. Each increment 
operation adds credit to each current member of P, and each decrement opera- 

tion on a member of P adds l/\H\ credits to that member 

Lemma 1. Any sequence of operations during which \H\ changes from h to h' > h 
includes at least {h' — h)-^ increment operations. 

Proof. There exist at least h' — h members of the set H after the sequence that were 
not members prior to the sequence. Each of these elements has f{x) < h prior to the 
sequence (else it would belong to H) and f{x) > h' after the sequence, so the number 
of increments for these elements alone must have been at least (h' — h)^. □ 

Lemma 2. Any element x that is removed from P must have accumulated credits. 

Proof. Let h be the value of \H\ at the time x was added to P, and h' be the value 
of max(/i, \H\) at the time it is removed. Then by the previous lemma, x must have 
accumulated [h' — lif- /h^ credits from increment operations, and Q.{{2h~h') /h) credits 
from decrement operations. But for any h' > h, {h' — lif-/h^ + {2h — h')/h — £2(1). □ 

The proof of Theorem |2] now follows. 

Proof. The number of additions is equal to the number of removals, plus the number of 
items that remain in H at the end of the sequence. But by Lemma[T]we can find a sub- 
sequence / of increase operations such that the final value of \H\ is 1 /hi). Thus, 
we need count only the number of times elements are removed from P. By Lemma [2j 
this number of removals is proportional to the total number of credits that have been 
accumulated by all elements over the course of O. But, since each operation assigned at 
most 1 / hi credits, this total is at most q. □ 
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Appendix II: Additional subgraph counting data structures 

If Sj ~ ^((G) denote the number of star subgraphs Ki j in G, we may maintain Sj, for 
any constant /, in constant time per update, as it is a sum of polynomials of the vertex 
degrees: Sj — Lvi^,.(<^v — 1) • • • {d^ — i — 1)//!. For instance, the number of claws (three- 
leaf stars) in G is sj, = Y,vdv{dv — ^ 2)/6. In at least one other nontrivial case we 
may maintain the number of four- vertex subgraphs of a certain type as efficiently as the 
number of triangles. 

Theorem 6. We may maintain a dynamic graph subject to edge insertions and deletions 
and to insertions and deletions of isolated vertices, and keep track of the number p^ of 
four-vertex paths in the graph, in amortized time 0{h) per update where h is the h-index 
of the graph at the time of an update. 

Proof. Let q denote the number of sequences of three edges that form either a path or 
a cycle in G. Let c/,, denote the degree of v (that is, its number of incident edges), and 
let Py denote the number of two-edge paths having v as an endpoint (that is, — 1) 
where the sum is over all neighbors of v in G). 

Inserting an edge uv into the graph G increases q by d^dy + Py +Py: the term d^d^ 
counts the paths with uv as middle edge, and the other two terms count the paths having 
V or u as endpoint. Similarly, removing edge uv decreases q by (c/j, — 1 ) (c/,, — 1 ) + [Pu — 
J,, + 1) + {Pv —d,i + l). Thus, if we can calculate P„ and Py, we can correctly update q. 

Our data structure stores the numbers c/,, for each vertex v, and the numbers P„ 
only for those vertices u that belong to the set H maintained by the gradual partition of 
Section [3] When a vertex is added to H, the value P„ stored for it may be computed in 
time 0{h). When we insert or delete an edge uv, the numbers Pj, and P,. that we need 
to use to update q may be found either by looking them up in this data structure (if the 
endpoints m or v of the updated edge belong to H) or in time 0{h) by looking at all 
neighbors of the endpoints if they do not belong to H. Finally, whenever we insert or 
delete an edge uv, we must update the numbers P„, for all vertices w belonging to H, 
where either w is one of the two endpoints u and v or it is adjacent to one or both of 
these endpoints; this update may be performed in constant time per member of H, or 
0(h) time total. 

The number of four-vertex paths that we maintain is then pi ^ q — 3c3 where C3 
denotes the number of triangles in the graph as maintained by our other structures. □ 

The counts of larger subgraphs in G obey additional linear relations: for instance, 
Y^yPy — P4 + 2p2 + 3^3 +4c4. However we have not been able to exploit these relations 
by finding efficient algorithms for maintaining the quantities p4 and C4. 
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Appendix III: Detailed analysis of real- world network data 



We calculated the /j-index of the networks in our sample in R, using a subroutine pro- 
vided by Carter Butts. The data that results from this calculation in plotted in Figure 

El 




4000 6000 
network size 



10000 



Fig. 2. Scatter plot of /i-index and network size 



Figure |2] suggests that the data might be more appropriately viewed on a log-log 
scale. This plot is seen in Figure [3] 

8.1 Quantile regression 

To find an upper bound on the scaling of the /z-index of our real world networks we 
clustered the data into two groups, and used quantile regression to fit the data with 
curves of the form logh = Po + Pi logn, at the 95th percentile. That is, we are looking 
for a power law h ~ cnP' , and we want 95% of the graphs to have an /i-index no larger 
than the one predicted by this law. We fit a law of this type to the two clusters separately 
to provide a more conservative and substantive prediction. The resulting regression lines 
are reported in Table |2] Corresponding goodness of fit measure are also reported in 
Table |3] We note that these are conservative estimates and the actual scaling is likely 
better. 
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Iog(network size] 



Fig. 3. Scatter plot of /z-index and network size, on log-log scale 
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Table 2. Coefficients for quantile regression lines 
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Fig, 4, H-index scaling using quantile regression fits 
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Table 3. Goodness of fit measures for quantile regression lines 
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Appendix IV: Raw data from analysis of real-world networks 
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